Code
require("pacman")
p_load(tidyverse, stargazer, lme4, gt, gtExtras, skimr, kableExtra, merTools, tidyr, tidytext)
load("D:\\OneDrive - HKUST Connect\\practice\\mixed models\\gpa.Rdata")In terms of structure, data might have one or multiple sources of clustering, and that clustering maybe hierarchical, such that clusters are nested within ohter cluster. For example, repeated observations nested within students, students nested within schools, schools nested within districts.
require("pacman")
p_load(tidyverse, stargazer, lme4, gt, gtExtras, skimr, kableExtra, merTools, tidyr, tidytext)
load("D:\\OneDrive - HKUST Connect\\practice\\mixed models\\gpa.Rdata")gpaCategorical data descriptive statistics:
gpa %>%
dplyr::select(where(is.factor) | where(is.character)) %>%
skim() %>%
as_tibble() %>%
dplyr::select(-factor.top_counts) %>%
gt() %>%
gt_highlight_rows(rows = seq(1, 6, 2),
fill = "lightgrey") %>%
gt_theme_nytimes()| skim_type | skim_variable | n_missing | complete_rate | factor.ordered | factor.n_unique |
|---|---|---|---|---|---|
| factor | student | 0 | 1.00 | FALSE | 200 |
| factor | occas | 0 | 1.00 | FALSE | 6 |
| factor | job | 0 | 1.00 | FALSE | 3 |
| factor | sex | 0 | 1.00 | FALSE | 2 |
| factor | admitted | 384 | 0.68 | FALSE | 2 |
| factor | semester | 0 | 1.00 | FALSE | 2 |
Numeric data descriptive statistics:
gpa %>%
skim() %>%
yank("numeric") %>%
gt() %>%
gt_highlight_rows(rows = c(1, 3),
fill = "lightgrey") %>%
gt_theme_nytimes()| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| gpa | 0 | 1 | 2.8650 | 0.3930484 | 1.7 | 2.6 | 2.8 | 3.1 | 4 | ▁▅▇▅▁ |
| highgpa | 0 | 1 | 2.9875 | 0.5948854 | 2.0 | 2.5 | 2.9 | 3.5 | 4 | ▇▇▆▇▆ |
| year | 0 | 1 | 2.0000 | 0.8168370 | 1.0 | 1.0 | 2.0 | 3.0 | 3 | ▇▁▇▁▇ |
| occasion | 0 | 1 | 2.5000 | 1.7085372 | 0.0 | 1.0 | 2.5 | 4.0 | 5 | ▇▃▃▃▃ |
It always helps to look before we leap.
gpa %>%
ggplot(aes(occasion, gpa)) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE, color = "#FF0000") +
geom_point() +
geom_line(aes(group = student), linetype = "dotted", color = "#5E5E5E") +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))gpa %>%
lm(gpa ~ occasion, data = .) %>%
stargazer(type = "text", single.row = TRUE, style = "ajps")
---------------------------------------------
gpa
---------------------------------------------
occasion 0.106*** (0.006)
Constant 2.599*** (0.018)
N 1200
R-squared 0.214
Adj. R-squared 0.213
Residual Std. Error 0.349 (df = 1198)
F Statistic 325.339*** (df = 1; 1198)
---------------------------------------------
***p < .01; **p < .05; *p < .1
The above tells us that starting out, i.e. when occasion is zero, the average GPA, denoted by the intercept, is 2.6. Additionally, as we move from semester to semester, we can expect GPA to increase by about 0.11 points. A side effect of doing so is that our standard errors are biased, and thus claims about statistical significance based on them would be off.
The code for mixed model will just like what we used for regression with lm, but with an additional component specifying the group, i.e. \(student effect\). The \((1|student)\) means that we are allowing the intercept, represent by \(1\), to vary by student.
gpa %>%
lmer(gpa ~ occasion + (1 | student), data = .) -> gpa_mixed
gpa_mixed %>%
stargazer(type = "text", single.row = TRUE, style = "ajps")
-------------------------------
gpa
-------------------------------
occasion 0.106*** (0.004)
Constant 2.599*** (0.022)
N 1200
Log Likelihood -204.446
AIC 416.893
BIC 437.253
-------------------------------
***p < .01; **p < .05; *p < .1
Confidence intervals is as follows:
gpa_mixed %>%
confint() 2.5 % 97.5 %
.sig01 0.22517423 0.2824604
.sigma 0.23071113 0.2518510
(Intercept) 2.55665145 2.6417771
occasion 0.09832589 0.1143027
We did not allow occasion to vary, so it is a constant, i.e. fixed effect for all student.
gpa_mixed %>%
REsim() %>%
plotREsim()mixed model as a multivel model:
\[\begin{align*} $gpa = b_{int\_student} + b_{occ}occasion + \epsilon$ $b_{int\_student} = b_{intercept} + student_{effect}$ \end{align*}\]If we add student a student level covariate, e.g sex, to the model, we then have the following.
\[\begin{align*} $gpa = b_{int\_student} + b_{occ}occasion + b_{sex}sex + (student_{effect} + epsilon)$ \end{align*}\]
So, adding cluster level covariates does not have any unusual effect on how we think about the mdodel. Note also, that we can create cluster level covariates as group means or some other summary of the observation level variables. This is especially common when the cluster represent geographical units and observation are people. For example, we might have income as a person level covariate, and use the median to represent the overall wealth of the geographical region.
data("sleepstudy")
glimpse(sleepstudy)Rows: 180
Columns: 3
$ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901, 3…
$ Days <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0…
$ Subject <fct> 308, 308, 308, 308, 308, 308, 308, 308, 308, 308, 309, 309, 3…
Run a regression with Reaction as the target variable and Days as the predictor.
sleepstudy %>%
lm(Reaction ~ Days, data = .) %>%
stargazer(type = "text", no.space = T, align = T, style = "ajps")
-------------------------------------------
Reaction
-------------------------------------------
Days 10.467***
(1.238)
Constant 251.405***
(6.610)
N 180
R-squared 0.286
Adj. R-squared 0.282
Residual Std. Error 47.715 (df = 178)
F Statistic 71.464*** (df = 1; 178)
-------------------------------------------
***p < .01; **p < .05; *p < .1
Run a mixed model with a random intercept for subject.
sleepstudy %>%
lmer(Reaction ~ Days + (1 | Subject), data = .) %>%
tidy()